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Abstract 

Most  artificial  intelligence  systems  today  work  on  simple  problems  and  artificial 
domains  because  they  rely  on  the  accurate  sensing  of  the  task  world.  Object 
recognition  is  a  crucial  part  of  the  sensing  challenge  and  machine  learning  stands 
in  a  position  to  catapult  object  recognition  into  real  world  domains.  Given  that, 
to  date,  machine  learning  has  not  delivered  general  object  recognition,  we  pro¬ 
pose  a  different  point  of  attack:  the  learning  architectures  themselves.  We  have 
developed  a  method  for  directly  learning  and  combining  algorithms  in  a  new  way 
that  imposes  little  burden  on  or  bias  from  the  humans  involved.  This  learning 
architecture,  PADO,  and  the  new  results  it  brings  to  the  problem  of  natural  image 
object  recognition  is  the  focus  of  this  report. 


^This  research  was  sponsored  by  the  Carnegie  Mellon  School  of  Computer  Science 


Aoa«saloa  ¥0** 

GM8i  ^ 

1  DTIC  “S-AB  u 

i  ^ 

1  Justlf icstlon-^ 


By 

an'.4/%2’ 
Spsoissl* 


Keywords:  PADO,  Genetic  Programming,  Object  Recognition,  Evolution,Parallel 
Algorithms,Incremental  Learning,Natural  Images, Greyscale  Video  Images,Libraries, Orchestration 


1 


1.  Introduction 

In  general,  AI  systems  use  symbols  to  represent  knowledge  and  to  reason  about 
tasks.  Most  of  these  systems  today  stiU  work  on  simple  problems  and  artificial 
domains.  One  of  the  main  reasons  for  this  is  the  common  assumption  that  sensing 
is  not  only  perfect,  but  also  that  sensors  return  specific  symbols,  not  raw  data. 
The  signal- to-symbol  problem  is  the  task  of  converting  raw  sensor  data  into  a  set 
of  symbols  that  the  data  can  be  seen  as  representing. 

One  of  the  main  goals  of  computer  vision  is  to  provide  a  solution  to  the  signal- 
to-symbol  problem.  In  particular,  this  goal  involves  object  recognition,  i.e.  the 
abifity  to  recognize  what  objects  are  shown  in  an  image.  Machine  learning  can  do 
induction  on  a  set  of  examples  to  learn  to  discriminate  among  classes.  These  two 
fields,  machine  learning  and  computer  vision,  are  natural  mates  and  are  particu¬ 
larly  suited  to  cooperate  on  object  recognition  tasks. 

Several  proven  machine  learning  architectures,  such  as  neural  networks,  have 
been  integrated  with  computer  vision  and  the  results  in  the  recognition  of  “ev¬ 
eryday”  objects  have  been  modest  at  best.  It  is  possible  that  better  parameter 
values,  more  training  data,  or  faster  computers  wiU  allow  one  of  these  architec¬ 
tures  to  make  some  significant  advance  in  the  field  of  object  recognition.  This 
report  proposes  a  different  view;  that  given  the  experienced  difficulties  with  the 
current  architectures,  a  more  profitable  path  is  to  investigate  new  architectures. 

Clearly,  any  solution  to  the  object  recognition  problem  needs  to  be  grounded 
in  an  algorithm  that  processes  intensity  values  from  an  image  signal.  Consider  the 
particular  task  of  differentiating  between  many  different  images  of  natural  objects 
in  natural  settings.  This  task  can  be  solved  by  learning  a  separate  algorithm  for 
discriminating  image  signals  of  each  object  class. 

Our  new  architecture,  PADO,  is  a  technique  for  learning  these  algorithms 
directly,  so  that  there  is  no  built-in  commitment  to  the  manner  in  which  the  algo¬ 
rithm  investigates  the  image  and  arrives  at  a  decision.  No  features  are  chosen  for 
PADO  and  no  attention  focusing  strategy  is  built  in.  PADO  {Parallel  Algorithm 
'Discovery  and  Orchestration)  uses  an  evolutionary  strategy  to  accomplish  these 
feats  of  self-generation.  The  motivation  for  PADO,  the  details  of  PADO,  and 
results  on  a  challenging  vision  problem  are  the  topic  of  this  report. 

A  challenging  vision  problem  in  object  recognition  can  be  found  in  high  res¬ 
olution,  noisy  images  of  real  world  objects  in  natural  settings.  In  the  course  of 
this  report,  such  an  image  set  will  be  introduced  as  an  example  domain  in  which 
PADO  can  learn  and  perform.  PADO’s  impressive  performance  on  this  difficult 
recognition  problem  will  then  be  repeated  in  a  second  vision  domain  with  different 
characteristics. 

This  report  will  provide  a  solid  basis  for  understanding  this  unique  architecture 
and  why  it  works.  The  experimental  results  wiU  show  PADO’s  promise  as  a 
new,  practical  solution  to  the  general  object  recognition  problem.  The  PADO 
architecture  is  entirely  independent  of  the  signal  type  to  be  classified  and  this 
construction  will  be  seen  to  promise  similar  results  on  signal  types  as  varied  as 
sonar,  speech,  and  text.  This  report  is  both  a  dissection  of  a  tested  approach  and 
the  introduction  of  a  new  way  of  doing  signal  understanding. 
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2.  Motivation 

The  Signal-to-Symbol  Problem  is  a  major  problem  in  AI,  both  in  its  own  right, 
and  because  many  other,  more  symbolic  parts  of  AI  need  to  be  given  symbols, 
rather  than  raw  data. 

As  introduced  above,  PADO  is  designed  to  attack  the  general  signal-to-symbol 
problem.  Why  pick  vision  as  PADO’s  first  test  domain?  The  real  reason  is  that 
vision  is  the  task  AI  most  badly  needs  to  have  solved.  The  two  most  obvious 
examples  of  agents  performing  the  signal  to  symbol  translation  are  in  hearing 
and  seeing.  Humans  and  other  animals  seem  to  effortlessly  take  in  information 
which  is  not  unfairly  represented  as  1  or  2  dimensional  waves  and  very  quickly 
and  without  conscious  effort  determine  some  correspondence  between  these  raw 
perceptions  and  notions  about  the  real  world.  In  humans,  we  call  these  notions 
symbols. 

Extracting  symbols  from  vision  and  hearing  are  both  hard  problems.  But 
relative  to  the  ultimate  goal  of  human  level  performance,  it  is  clear  that  AI  has 
had  more  success  engineering  solutions  to  the  sound  problem  than  to  the  vision 
problem.  So  because  it  is  important  to  AI  and  because  it  is  still  a  mostly  unsolved 
field,  PADO’s  first  problem  was  chosen  to  be  vision. 

As  will  be  detailed  in  the  next  section,  learning  is  relatively  new  to  the  field 
of  computer  vision  and  has  brought  with  it  only  modest  improvements  in  most 
areas.  Given  that  learning  in  vision  has  had  this  modest  success  rate  in  uncon¬ 
strained  problems,  we  suggest  that  too  much  effort  is  being  concentrated  on  the 
effort  to  scale  up  these  architectures  so  that  they  work  better  or  with  less  human 
intervention  (i.e.  preprocessing). 

After  aU,  Neural  Networks  (NN)  as  an  example,  were  not  designed  to  model  our 
visual  cortex  or  our  brains.  They  were  intended  to  be  like  very  small  pieces  of  our 
brain.  There  is  no  good  reason  to  think  that  the  large  amount  of  processing  that 
has  to  go  on  in  order  to  do  general  object  recognition  can  be  done  by  a  NN  with  a 
few  hundred  inputs  or  one  with  at  most  a  few  tens  of  hidden  units.  Even  ignoring 
this,  it  is  not  obvious  that  NNs  wiU  be  easy  to  train  on  a  function  as  complex  as 
the  mapping  from  images  to  classes.  Instead,  we  designed  an  architecture  that 
directly  addresses  some  of  these  issues. 

This  report  is  organized  as  follows.  Section  3  discusses  related  work  in  the 
areas  of  object  recognition  and  genetic  programming.  Section  4  provides  a  short 
introduction  to  the  learning  power  of  evolution.  Section  5  details  the  PADO 
architecture  and  the  language  choices  made  for  our  experiments.  Section  6  gives 
the  experimental  background,  set  up,  and  example  pictures.  Section  7  shows 
results  on  a  series  of  experiments.  Section  8  is  a  series  of  discussions  on  some  of 
the  most  puzzling  points  that  this  report  brings  up.  Section  9  looks  forward  to 
the  work  in  progress  now  and  the  near  future  goals  we  have  for  this  work.  And 
finally,  Section  10  concludes  by  bringing  together  the  highlights  of  the  report. 

3.  Related  Work 

Object  Recognition  has  been  a  heavily  researched  area  for  almost  30  years.  Until 
recently  however,  the  objects  to  be  recognized  were  usually  highly  geometrical  in 
shape.  In  the  majority  of  cases,  the  recognizer  already  had  some  model,  either 
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explicit  (e.g.  CAD)[2]  or  implicit  (e.g.  hand-coded  features)[3],  of  what  the 
objects  were  like;  the  task  was  really  to  try  to  find  objects  in  the  picture  and 
correctly  identify  their  pose. 

The  demand  for  more  robust  systems  with  few  constraints,  coupled  with  the 
rising  cost  of  programmer  time  relative  to  computer  cycles  has  pushed  object 
recognition  and  machine  learning  together.  The  field  of  computer  vision  is  far 
too  big  even  to  sketch  here.  While  we  focus  on  vision  and  learning,  we  will  only 
claim  that  to  obtain  similar  results  to  those  presented  in  this  report,  a  hand-coded 
system,  if  at  all  feasible,  would  require  a  significant  amount  of  programmer  time. 

Learning  has  been  used  for  a  variety  of  purposes  with  respect  to  object  recog¬ 
nition.  Researchers  have  tried  to  learn  which  hand-coded  features  to  use  on  a 
particular  problem  [8].  Researchers  have  tried  to  learn  models  of  the  objects 
based  on  a  number  of  well  constrained  images  of  these  objects  and  then  use  these 
models  as  mentioned  above  [19]. 

Learning  has  been  applied  on  a  larger  level,  particularly  in  the  form  of  Neural 
Networks.  The  major  problem  with  using  Neural  Networks  is  that  with  today’s 
technology,  NNs  cannot  take  full  video  images  as  input.  Imagine  even  a  small 
image  with  256x256  resolution.  A  NN  with  just  16  hidden  units  fully  connected  to 
the  1/16  million  inputs  would  have  over  1  million  weights  to  fix.  Back  propagation 
would  take  a  long  time  to  train  such  a  net. 

When  preprocessing  can  be  done  to  significantly  reduce  the  images’  resolution 
while  preserving  the  relevant  information,  NNs  can  be  effective.  Of  course  this 
most  often  occurs  when  either  the  problem  is  not  difficult  or  where  the  preprocess¬ 
ing  is  very  clever.  When  the  preprocessor  has  been  made  very  clever,  the  problem 
has  not  really  been  solved,  but  simply  been  moved  to  the  problem  of  creating  this 
nice  preprocessor. 

Though  it  is  not  directly  concerned  with  object  recognition.  Dean  Pomerleau’s 
work  on  driving  the  ALVINN  using  a  NN  is  an  example  of  a  very  successful 
application  of  NN’s  to  a  computer  vision  problem  [10].  Here  the  preprocessor  was 
a  little  clever  and  the  problem  itself  was  not  too  hard.  This  work  is  important 
largely  because  Dean  was  one  of  the  first  to  apply  learning  to  this  kind  of  real 
time  reactive  vision  problem. 

Thrun’s  and  Mitchell’s  work  using  NNs  to  do  visual  object  recognition  is  an¬ 
other  good  example  of  work  with  goals  similar  to  our  own  [18].  They  take  video 
images  and  preprocess  them  to  get  low  resolution  images  which  are  then  given  to 
a  NN  for  object  recognition  training.  Their  work  focuses  on  studying  the  effect  of 
lifelong  learning  on  the  ability  to  find  general  object  type  invariants.  This  work 
is  mentioned  here  not  only  because  it  is  has  some  experimental  similarities  to  our 
work,  but  because  we  have  used  their  data  for  some  of  the  experiments  discussed 
in  this  report. 

Because  part  of  the  PADO  system  is  a  type  of  genetic  programming  (GP), 
there  should  be  some  mention  here  of  the  sort  of  vision  related  work  that  has 
been  done  within  that  paradigm. 

As  far  as  we  know  there  are  no  published  results  of  the  sort  discussed  in  this 
report:  that  is,  none  that  apply  genetic  programming  or  genetic  algorithms  di¬ 
rectly  to  fuU  video  images  and  do  object  recognition  on  the  basis  of  that  input. 
The  work  that  has  been  done  seems  to  fall  into  two  major  categories:  bitmap 
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recognition  and  learned  aids  for  vision  problems  (including  object  recognition). 
There  have  been  some  examples  of  genetic  programming  applied  to  bitmaps  (usu¬ 
ally  font  bitmaps)  in  order  to  do  classification  (  [7],  [1]).  In  between  there  are 
works  like  [4]  that  applied  GP  to  a  restricted  subset  of  a  black/ white  silhouette 
of  a  person  and  tried  to  learn  where  one  of  the  hands  was.  Learned  aids  to  object 
recognition  can  be  seen  in  works  like  [11]  and  [9].  For  example,  in  [11],  GP  is  used 
to  improve  the  performance  of  an  army  system  for  locating  tanks  by  learning  to 
choose  from  among  existing  system  components. 

There  is  a  significant  amount  of  work  that  is  related  to  PADO,  but  no  single 
piece  of  work  or  combination  of  several  even  partially  overlaps  with  the  details  of 
the  PADO  architecture.  The  method  of  orchestration  and  parallel  execution  of 
learned  algorithms  for  signal  classification  has,  until  now,  been  unexplored. 

4.  Evolution  for  Inductive  Generalization 

Evolutionary  computation  is  biologically  motivated.  In  nature,  we  see  that  the 
combination  of  survival  of  the  fittest,  fitness  proportionate  reproduction,  and 
genetic  recombination  is  an  extremely  powerful  tool  for  finding  solutions  to  bi¬ 
ological  problems.  In  this  section  we  introduce  the  basic  nature  of  such  genetic 
evolutionary  processes. 

Suppose  that  we  have  a  large 
group  of  things,  and  some  measure  of 
how  good  a  thing  is.  We  can  apply 
this  measure  to  each  of  these  things 
and  get  an  approximate  or  exact  fit¬ 
ness  for  each  thing.  Now  suppose  that 
we  allow  each  thing  to  be  represented 
in  a  new  group  of  things  in  propor¬ 
tion  to  its  fitness  relative  to  the  other 
things  m  the  group.  The  best  things  m 
the  old  group,  are  likely  to  have  mul¬ 
tiple  representation  in  the  new  group 
and  the  worst  things  in  the  old  group 
are  likely  to  have  no  representation  in 
the  new  group.  When  the  new  group 
is  fixed  to  be  the  same  size  as  the  old 
group  this  scheme  accomplishes  both 
survival  of  the  fittest  and  fitness  pro¬ 
portionate  reproduction.  If  nothing 
else  changed  between  each  successive 
group,  the  current  group  would  soon 
be  filled  with  many  instances  of  the 
most  fit  thing  in  the  group. 

Suppose  that  before  measuring  the  fitness  of  each  thing  in  the  new  group,  we 
change  some  of  them  in  a  random  or  semi-random  way.  This  is  the  most  general 
form  of  “genetic  recombination.”  These  changes  introduce  some  chance  that  one 
of  the  new  things  will  have  higher  fitness  than  any  of  the  old  things.  After  many 
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new  groups  have  come  and  gone  we  can  expect  that  the  best  thing  in  the  current 
group  will  be  much  better,  according  to  our  measure,  than  any  of  the  things  that 
were  in  the  original  group.  That  is  the  concept  of  evolution.  (See  Chart  above). 

Evolutionary  computation  is  a  form  of  best-first  search.  Exponentially  increas¬ 
ing  representation  is  given  to  those  things  that  have  highest  fitness  and  so  those 
points  in  the  space  are  exponentially  more  likely  to  be  examined  next,  relative  to 
the  other  points  under  consideration  (i.e.  the  other  things  in  the  group)  [12,  13]. 

In  the  vocabulary  of  evolutionary  computation,  a  group  of  things  is  called  a 
population.  To  distinguish  one  population  from  another  they  are  referred  to  as 
generations.  The  initial  population  is  traditionally  called  “Generation  0”  and 
each  successive  generation  is  numbered  in  increasing  integer  order. 

The  exact  structure  of  a  thing  varies  from  field  to  field  in  evolutionary  com¬ 
putation.  In  genetic  algorithms  things  are  called  alleles  and  take  the  form  of  bit 
strings  [5].  In  genetic  programming  things  are  called  functions  and  take  the  form 
of  Lisp-like  nested  primitive  function  calls  [6].  In  PADO  they  will  be  referred  to 
as  programs.  The  form  they  take  will  also  be  represented  in  a  Lisp-like  syntax, 
but  this  is  only  a  partial  representation  of  what  the  program  is.  (See  Appendix  1 
for  an  example  PADO  program). 

Genetic  recombinations  come  in  many  dilferent  varieties.  The  two  most  com¬ 
mon  and  the  two  which  are  directly  relevant  to  this  report  are  crossover  and 
mutation.  In  crossover  two  things  are  chosen  and  one  subpart  from  each  is  se¬ 
lected.  Then  these  two  subparts  are  exchanged  and  these  two  new  things  are 
placed  back  in  the  population.  In  mutation,  one  thing  is  chosen  and  one  subpart 
is  selected.  This  subpart  is  changed  in  some  random  way  and  this  new  thing  is 
placed  back  in  the  population.  In  both  cases  the  syntax  of  the  things  is  usually 
constrained  so  that  these  changes  always  produce  legal  new  things. 

The  most  crucial  aspect  of  evolutionary  computation  is  that  it  is  not  a  random 
search  of  the  space  of  things.  Because  there  is  a  correlation  between  syntactic  and 
functional  similarity,  these  recombinations  explore  things  which  are  likely  to  be 
similar  in  their  fitness.  This  correlation  is  the  essence  of  hill-climbing.  Combine 
this  version  of  hillclimbing  with  the  aspects  of  best-first  search  already  mentioned 
and  we  have  a  powerful  tool  for  searching  almost  any  space  whose  decomposable 
elements  have  some  fitness  variation. 

5.  PADO  Architecture 

Part  of  the  PADO  architecture  falls  under  the  general  heading  of  evolutionary 
computation.  This  section  will  discuss  the  way  PADO  works  and  how  its  central 
component  is  an  instance  of  the  general  scheme  that  was  described  in  the  previous 
section.  During  the  first  part  of  this  section  the  inputs  will  be  considered  arbitrary 
signals.  Later  in  the  section  and  then  for  the  rest  of  the  report,  the  specific  signal 
type  of  a  still  video  image  wiU  be  used  as  an  example.  But  because  the  PADO 
architecture  was  designed  to  apply  to  any  signal  type,  that  is  how  it  will  be 
introduced. 

The  goal  of  the  PADO  architecture  is  to  learn  to  take  signals  as  input  and 
output  correct  labels.  When  there  are  C  classes  to  choose  from,  PADO 
starts  by  learning  C  different  systems.  System  J  is  responsible  for  taking 
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a  signal  as  input  and  returning  a  confidence  that  class  1  is  the  correct  label. 
Clearly,  if  all  C  systems  worked  perfectly,  labeling  each  signal  correctly  would  be 
as  simple  as  picking  the  unique  non-zero  confidence  value.  If,  for  example,  system 
J  returned  a  non-zero  confidence  value,  then  the  correct  label  would  be  J .  In 
the  real  world,  none  of  the  C  systems  will  work  perfectly.  This  leads  us  to  the 
recurring  two  questions  of  the  PADO  architecture; 

1.  “How  does  PADO  learn  good  components?” 

2.  “How  does  PADO  orchestrate  them  for  maximum  effect?” 

We  will  explain  how  PADO  orchestrates  these  systems  in  Section  7.3.  Now,  let’s 
delve  into  how  one  of  these  systems  is  built. 

System  J  is  built  out  of  several  programs.  Each  of  these  programs  does  exactly 
what  the  system  as  a  whole  does:  it  takes  a  signal  as  input  and  returns  a  confidence 
value  that  label  X  is  the  correct  label.  The  reason  for  this  seeming  redundancy  will 
be  justified  and  discussed  in  Section  10.  PADO’s  orchestration  of  these  programs 
into  a  single  system  will  be  discussed  in  Section  7.3. 

PADO  evolves  these  programs  along  the  general  lines  described  in  the  previous 
section.  Programs  learned  by  PADO  are  written  in  a  functional  language  that  is 
PAD  0-specific.  During  the  training  phase  of  learning,  these  programs  are  inter¬ 
preted,  not  compiled.  So  like  Lisp,  the  programs  can  be  compiled  or  interpreted, 
but  during  the  “construction”  phase  they  are  simply  interpreted. 

At  the  beginning  of  a  learning  session,  the  main  population  is  filled  with  V 
programs  that  have  been  randomly  generated  using  a  grammar  for  the  legal  syntax 
of  the  language.  All  programs  in  this  language  are  constrained  by  the  syntax  to 
return  a  number  that  is  interpreted  as  a  confidence  value  between  some  minimum 
confidence  (MinConf)  and  some  maximum  confidence  (MaxConf). 

At  the  beginning  of  a  new  generation,  each  program  in  the  population  is 
presented  with  T  training  signals  and  the  T  confidences  it  returns  are  recorded. 
Then  the  population  is  divided  into  C  different  groups  of  size  V  jC.  The  programs 
in  group  X  are  the  VjC  programs  that  recognized  class  X  better  than  any  other 
class  in  the  sense  that  they  maximized  a  reward  function  Reward  when  K  =  X 
(A  is  the  class  to  which  PADO  is  considering  assigning  program  C?). 

/*  R  is  the  reward  and  C  is  the  number  of  classes.  Guess[i[/][j]  is  the 
confidence  program  V  returned  for  image  j.  ObjectClass[j]  is  the  object 
type  that  appears  in  image  j.  */ 

int  Reward(class  K,  int  Guess[  ]) 

A  =  0; 

Loop  j  =  1  to  MaxResponses 

If  (/(f  =  OhjectClass[j])  Then 
R  =  R  +  {{C  —  1)  =(=  Guess[U][j]y, 

Else 

R  =  R  —  Guess[U][j]', 

return  R; 
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On  images  that  the  program 
should  return  MaxConf  for,  the  re¬ 
ward  is  multiplied  by  C  —  1  so  that, 
even  though  this  only  happens  once  in 
C  times,  these  images  will  account  for 
half  the  reward.^ 

Each  group  is  then  sorted  by  in¬ 
creasing  fitness  and  each  program  is 
ranked  accordingly.  A  new  total  pop¬ 
ulation  is  created  by  putting  a  copy  of 
Program!  in  the  new  population  with 
probability  2  =t=  rank(X)/{PfC).  The 
expected  number  of  copies  of  the  best 
program  in  Groupi  is  2,  the  expected 
number  of  copies  of  the  median  pro¬ 
gram  is  1,  and  the  expected  number  of 
copies  of  the  worst  program  in  Groupj 
is  2/(P/C).  This  is  fitness  proportion¬ 
ate  reproduction. 

The  Libraries  are  functions  avail¬ 
able  to  all  programs.  After  the  di¬ 
vision  of  the  population,  the  libraries 
are  updated  according  to  how  widely 
and  how  often  they  were  used.  These 
statistics  are  weighted  by  the  fitnesses 
of  the  programs  that  called  them. 

Finally,  a  large  percent  of 
the  new  total  population  is  sub¬ 
jected  to  crossover  and  another, 
much  smaller  percent  is  subjected  to 
mutation. ^Crossover  in  PADO  is  more 
complicated  than  its  standard  form 
in  genetic  algorithms  or  genetic  pro¬ 
gramming.  In  PADO  two  programs 
are  chosen  and  given  to  a  “SMART 
crossover”  algorithm. 

This  algorithm  examines  the  two  programs  and  decides  where  they  should  be 
crossed  over.  Then  the  two  subparts  of  the  programs  chosen  by  the  algorithm  are 
switched,  creating  two  new  programs.  These  two  new  programs  replace  the  two 
old  programs  in  the  population.  Mutation  in  PADO  works  very  much  as  it  does 
in  the  general  case  described  in  the  previous  section.  One  program  is  chosen  and 
a  randomly  chosen  subpart  is  replaced  with  a  randomly  generated  subpart.  This 
changed  program  then  replaces  the  old  program  in  the  new  total  population. 

^This  seems  reasonable  since  it  should  be  as  important  to  say  YES  when  appropriate  as  to 
say  NO  when  appropriate  since  these  two  cases  are  respectively  coverage  and  accuracy. 

^The  implementation  details  of  these  exact  percentages  and  the  reasons  why  they  are  not 
equal  are  largely  traditional  and  affect  the  speed,  but  not  the  results  in  this  report. 


Initialize  Population  with  Random  PROGRAMS 

f 

New  Population 


Measure  Confidence-Reaction  of  each 
program  on  a  Set  of  Training  images 

V 

Separate  the  Main  Population  into  C  pools 
according  to  the  programs’  respective  talents 


Extract  Best  from  each  Pool 
for  Object  Recognition 


Create  C  Mating  Pools  using  Fitness 
Proportionate  Reproduction. 


Update  Libraries  according  to 
Fitness-Weighted-Use. 


SMART  Operators  choose  Recombination  Sites 


Apply  Genetic  Recombination  to  each 
Mating  Pool 


Mating  Pools  are  Merged  into  New  Population  - 
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At  this  point  we  have  a  new  population  and  the  process  of  evaluation,  re¬ 
production,  and  recombination  repeats.  After  many  generations  we  find  that  the 
best  programs  in  the  population  are  much  better  than  any  that  were  created 
(randomly)  at  the  start  of  the  process. 

To  extract  programs  to  use  in  the  systems,  we  can  pause  the  process  after  the 
evaluation  step  of  a  generation  and  copy  out  those  programs  that  scored  best  or 
near  best  in  each  group  J.  So  this  architecture  is  an  anytime  learning  system:  at 
any  time  we  can  generate  a  system  for  signal  classification  using  what  we  have 
learned  so  far. 

5.1.  The  Language  of  a  PADO  Program 

Each  PADO  program  is  made  up  of 
three  important  parts:  a  main  loop, 
an  ADF,  and  an  Indexed  Memory. 

Both  the  main  loop  and  the  ADF 
(Automatically  Defined  Function)  are 
written  in  a  PAD  0-specific  functional 
language.  The  main  loop  is  repeat¬ 
edly  applied  for  a  fixed  time  limit.  A 
weighted  average  of  the  responses  the 
program  gave  on  each  iteration  is  com¬ 
puted  and  interpreted  as  the  answer. 

The  weight  of  a  response  at  time 
is  i.  Later  responses  count  more  to¬ 
wards  the  total  response  of  the  pro¬ 
gram.  PADO’s  programs  are  guaran¬ 
teed  to  halt  and  respond  in  a  fixed 
amount  of  time. 

The  Indexed  Memory  is  an  array  of  integers  indexed  by  the  integers.  As  will 
be  seen  below,  each  program  has  the  abiUty  to  access  any  element  of  its  memory, 
either  to  read  from  it  or  to  write  to  it  [14].  This  memory  scheme,  in  conjunction 
with  the  main  loop  described  above  has  been  shown  to  be  Turing  complete  [16]. 
In  practice  this  memory  has  a  finite  range  of  the  integers  over  which  it  is  indexed 
and  each  element  can  hold  integers  in  a  finite  range.  However,  indexed  memory 
can  been  seen  as  the  simplest  memory  structure  that  can  practically  support  all 
other  memory  structures.  Indeed,  indexed  memory  has  been  successfully  used  to 
build  up  complex  mental  models  of  local  geography  [14]. 

The  ADF  is  a  function  definition  that  evolves  along  with  the  main  loop.  This 
ADF  may  be  called  as  often  as  desired  in  the  main  loop  but  may  not  call  itself 
[7].  While  each  program  has  a  private  main  loop,  a  private  ADF,  and  a  private 
indexed  memory,  there  are  a  number  of  Library  functions  that  may  be  called  by 
the  entire  population  of  programs. 

Like  every  functional  language,  PADO  programs  are  composed  of  nested  func¬ 
tions.  These  functions  act  on  the  results  of  the  functions  nested  inside  them 
and  any  terminals  that  make  up  their  parameters.  Terminals  are  the  zero  arity 
functions  hke  constants  and  variables. 
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In  the  main  loop,  the  terminals  are  the  integer  values  0  thru  255.  In  the  ADF 
and  hbrary  functions,  the  terminals  are  the  integer  values  0  thru  255  plus  the 
parameters  X,  Y,  U,  and  V  (see  below). 

Here  is  a  brief  summary  of  the  language  primitives  and  their  effects;^ 

Algebraic  Primitives 

(ADD  X  Y),(SUB  X  Y),(MULT  X  Y),(DIV  X  Y),(NOT  X),(MAX 

X  Y),(MIN  X  Y)  :  These  functions  allow  basic  manipulation  of  the  inte¬ 
gers.  AU  values  are  constrained  to  be  in  the  range  0  to  255.  So  numbers  are 
rounded  up  or  down  as  necessary.  For  example,  (DIV  X  0)  is  defined  to  be 

255  and  (NOT  X)  maps  the  set  1..255  to  0  and  0  to  1. 

Memory-access  Primitives 

(READ  X),( WRITE  X  Y)  :  These  two  functions  access  the  memory  of 
the  program.  Each  program  has  a  memory  which  is  organized  as  an  array  of 

256  integers  that  can  take  on  values  between  0  and  255.  (READ  X)  returns 
the  integer  stored  in  position  X  of  the  memory  array.  (WRITE  X  Y)  takes 
the  value  X  and  writes  it  into  position  Y  of  the  indexed  memory.  WRITE 
returns  the  OLD  value  of  position  Y  (i.e.  a  WRITE  is  a  READ  with  a  side- 
effect).  The  memory  is  cleared  (all  positions  set  to  zero)  at  the  beginning 
of  a  program  execution  and  then  during  that  time  all  writes  persist  unless 
overwritten  by  a  newer  write  to  the  same  position. 

Branching  Primitives 

(IF-THEN-ELSE  X  Y  Z), (EITHER  X  Y  Z)  :  These  two  are  short- 
circuit  branches.  Either  the  Y  or  Z  subtree  is  evaluated  and  returned,  but 
the  other  is  never  even  evaluated.  This  is  an  important  distinction  when 
subtrees  can  have  side  effects  like  WRITE.  (IF-THEN-ELSE  X  Y  Z)is  a 
deterministic  branch.  If  X  evaluates  to  a  non-zero  number  then  Y  is  evalu¬ 
ated  and  returned,  otherwise  Z  is  evaluated  and  returned.  (EITHER  X  Y 
Z)  is  a  non-deterministic  branch.  A  number  between  0  and  255  is  chosen 
at  random.  If  the  number  is  less  than  X  then  Y  is  evaluated  and  returned, 
otherwise  Z  is  evaluated  and  returned. 

Signal  Primitives 

(PIXEL  X  Y), (REGION  X  Y), (LEAST  XI  Y1  X2  Y2)  ,(MOST 

XI  Y1  X2  Y2), (AVERAGE  XI  Y1  X2  Y2), (VARIANCE  XI  Y1 

X2  Y2)  :  These  are  the  language  functions  that  can  access  the  image  data. 
PIXEL  returns  the  intensity  value  at  that  point  in  the  image.  REGION 
returns  the  Gaussian  sampled  value  at  that  point  with  a  standard  deviation 
of  5  pixels.  The  other  four  return  the  respective  functions  applied  to  the 
rectangle  in  the  image  that  the  four  parameters  specify.  (The  smaller  of  the 
two  X  values  is  interpreted  as  XupperLeft  a-nd  the  larger  is  interpreted  as 
XhowerRight-  The  Same  is  done  for  the  Y  values  and  so  the  four  parameters 
always  specific  a  legal  rectangle  in  the  image). 

portion  of  the  discussion  section  will  be  devoted  to  a  justification  of  this  language  design 
and  its  ramifications 
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Routine  Primitives 

(ADF  X  Y  U  V)  :  ADF  stands  for  Automatically  Defined  Function.  This 
function  is  private  to  the  individual  that  uses  it  and  can  be  called  as  many 
times  as  desired  from  the  main  loop.  Each  individual  has  exactly  one  ADF 
which  evolves  along  with  the  main  loop.  The  ADF  differs  from  the  main 
loop  in  three  ways.  It  may  not  call  ADF,  it  may  not  call  Library  functions, 
and  it  has  4  extra  legal  terminals:  X,  Y,  U,  and  V.  These  extra  termainals 
are  local  variables  that  take  on  the  values  of  the  four  sub-expressions  that 
were  used  in  each  particular  call  to  ADF  from  the  main  loop. 

(LIBRARY[i]  X  Y  U  V)  :  There  are  100  library  functions.  The  i  is  not 
really  a  parameter.  Instead  a  call  to  a  Library  function  from  some  program’s 
main  loop  might  look  Uke  (Library57  56  (ADD  1  99)  0  (WRITE  3  19)).  All 
100  library  functions  are  available  to  all  programs  in  the  population.  How 
these  library  functions  are  created  and  changed  will  be  discussed  in  Section 
10. 

6.  The  Experiments 

The  discussion  of  results  will  focus  on  two  sets  of  data.  One  set  was  taken  by 
us.  The  other  set  was  taken  by  Sebastian  Thrun  who  is  also  working  in  machine 
vision  [18].  For  reference  purposes,  we  will  call  the  first  set  A  and  the  second  set 

T. 

Set  T  has  seven  classes:  Book,  Bottle,  Cap,  Coke  Can,  Glasses,  Hammer,  and 
Shoe.  The  lighting,  position  and  rotation  of  the  objects  varies  widely.  The  floor 
and  wall  behind  and  underneath  the  objects  are  constant.  Nothing  else  except 
the  object  is  in  the  image.  However,  the  distance  from  the  object  to  the  camera 
ranges  from  1.5  to  4  feet  and  there  is  often  severe  foreshortening  of  the  objects  in 
the  image.  See  columns  one  and  two  of  Figure  1  for  sample  images. 

Set  A  has  seven  classes:  Nothing,  Bear,  Long  flute.  Pan  flute,  Thermos,  Book, 
and  Racket.  The  class  Nothing  is  a  collection  of  images  which  are  empty  or  show 
a  hand  holding  some  object  (like  a  cup  or  a  ball)  that  is  not  one  of  the  other 
six  classes.  All  pictures  are  taken  against  a  variety  of  solid  colored  backgrounds 
and  contain  part  of  a  hand  and  arm.  The  hand  holds  one  of  the  objects,  often 
partially  occluding  it.  The  location  and  rotation  of  the  object  is  only  constrained 
so  that  the  object  is  completely  in  the  image.  The  lighting  varies  dramatically  in 
intensity  and  position.  The  distance  from  the  object  to  the  camera  ranges  from 
2.5  to  3.5  feet.  The  objects  are  never  severely  foreshortened.  See  columns  three 
and  four  of  Figure  1  for  sample  images. 

Set  A  was  created  with  several  criteria  in  mind.  We  wanted  a  set  of  images  that 
could  be  easily  distinguished  by  people,  but  were  not  trivially  different  in  some 
way  that  the  computer  could  “notice”.  We  wanted  some  noise  in  the  images  but 
didn’t  want  to  have  unconstrained  backgrounds.  We  wanted  a  sufficient  number 
of  classes  so  that  the  results  could  give  some  indication  of  PADO’s  practical  value. 
However,  it  was  important  that  the  number  of  classes  be  small  enough  that  the 
learning  and  science  was  not  lost  in  the  engineering. 

As  a  result,  we  chose  grey  scale  images  of  the  seven  classes  listed  above  for  set 
A.  The  backgrounds  varied  in  color  but  were  always  solid.  The  noise  was  always 


Sample  Train  Images  Sample  Test  Images 


Figure  1:  2S  randomly 
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Sample  Train  Images  Sample  Test  Images 


images  from  Sets  T  and  A. 
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there  in  the  form  of  a  hand  and  arm.  In  general  the  hand  and  arm  takes  up  about 
half  as  much  area  of  the  image  as  the  object  itself  so  that  the  signal  to  noise  ratio 
is  between  1.0  and  2.0.  And  we  chose  to  test  the  system  on  seven  classes.  Initial 
tests  proved  PADO’s  effectiveness  in  a  three  class  object  recognition  problem,  so 
slightly  more  than  doubling  the  number  of  classes  seemed  like  the  next  qualitative 
step  up  in  difficulty. 

Set  T  was  created  by  Sebastian  Thrun  for  his  own  work.  The  pictures  he  took 
were  originally  in  color,  but  allowing  PADO  access  to  the  color  images  turned  out 
to  be  too  easy  (PADO  classification  accuracy  of  95%)  so  we  removed  the  color  and 
saturation  information  and  kept  only  the  brightness  information.  We  thought  it 
was  important,  perhaps  even  more  important,  to  include  data  on  images  taken  by 
someone  else  for  a  different  purpose  as  this  is  the  really  goal  of  the  PADO  project. 
That  is,  the  ability  to  distinguish  between  classes  of  signals  that  were  in  no  way 
designed,  taken,  or  preprocessed  with  PADO  in  mind. 

Because  these  images  all  have  at  most  one  object,  it  could  be  argued  that 
the  results  shown  in  the  following  sections  are  not  object  recognition  but  rather 
classification  of  images  based  on  the  object  shown  in  the  image.  This  is  a  murky 
distinction.  What  makes  object  recognition  different  from  image  classification 
(since  we  can  imagine  that  an  image  that  contains  both  a  shoe  and  hammer 
would  be  correctly  classified  both  as  a  SHOE  image  and  as  a  HAMMER  image)? 
The  only  reasonable  distinction  might  be  that  an  object  recognizer  must  do  more 
than  just  know  of  the  object’s  existence  in  the  image.  It  must  be  able  to  locate  it 
in  the  image,  perhaps  even  segment  it  from  the  rest  of  the  image.  This  distinction 
is  loose  enough  that  we  feel  free  to  label  these  results  as  object  recognition.  More 
will  be  said  in  the  discussion  section  about  finding  the  object  in  the  image. 

The  following  sections  shows  three  different  experiments.^  In  each  of  the  three 
experiments  the  general  methodology  was  the  same.  The  training  set  consisted  of 
14  images  from  each  class  for  a  total  of  98  images.  The  testing  set  consisted  of  14 
different  images  from  each  class  for  a  total  of  98  images.^  The  environment  was 
initialized  with  2100  random  programs,  100  random  library  functions,  and  100 
random  SMART  operators.  The  environment  was  then  run  for  50  generations, 
examining  the  fitness  of  each  program  relative  to  the  98  training  images.  For  the 
object  recognition,  the  best  seven  programs  as  determined  by  their  performance  on 
the  training  set,  were  extracted  and  orchestrated  for  testing  as  a  complete  PADO 
recognizer.  Each  experiment  was  done  five  times  in  order  to  obtain  reliable  results. 

7.  Experimental  Results  and  Analysis 

Through  this  section  we  show  how  PADO  actually  accomplishes  difficult  tasks  in 
visual  object  recognition.  The  results  shown  here  are  not  the  only  hurdles  that 
PADO  has  cleared.  In  fact,  these  results  are  not  PADO’s  most  flattering.  But 
they  give  an  accurate,  easy  to  understand  picture  of  the  sort  of  performance  that 
PADO  can  currently  deliver  on  real  tasks.  The  section  is  organized  in  three  parts. 

The  experiment  on  object  recognition  in  Section  7.3  uses  the  results  of  the  discrimination 
experiment  done  in  Section  7.1. 

^In  order  to  get  a  sufficiently  large  number  of  different  images  for  set  A,  each  “original  image” 
was  subjected  to  a  variety  of  transformations  (e.g.  mirror  image,  contrast  up,  etc.)  to  produce 
several  different  images. 
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Each  subsection  will  explain  an  experiment,  detail  the  results,  and  then  give  some 
basic  analysis  and  implications. 

7.1.  The  Standard  PADO  Technique 

At  the  end  of  each  generation,  each  program  is  placed  in  the  group  K  that  maxi¬ 
mizes  its  reward.  That  reward,  as  mentioned  before,  is  as  follows  for  an  individual 
in  group  K. 


Discrimination  Reward  Percentage 
60  I - \ - 1 - 1 - 1 - 1 - 1 - 1  I  r 


0  5  10  15  20  25  30  35  40  45  50 

Generations 


Figure  2:  PADO  discrimination  reward  percentage  on  test  images  of  Set  A. 

/*  R  is  the  rew^ard  and  C  is  the  number  of  classes.  Guess[t/][j]  is  the 
confidence  program  U  returned  for  image  j.  Object  Class  [j]  is  the  object 
type  that  appears  in  image  j.  */ 
int  Reward(class  K,  int  Guess[  ]) 

R  =  0; 

Loop  j  =  1  to  MaxResponses 

If  {K  =  ObjectClass[j])  Then 
R  =  R  +  ((C  -  1)  +  Guess[U][j]); 

Else 

R  =  R  -  Guess[U][j]', 

return  R; 

7.1.1.  Results 

Figures  2  and  3  show  the  average  reward  of  the  top  <S  programs  in  each  class 
for  each  generation  between  1  and  50  based  on  98  test  images.  In  other  words, 
the  S  programs  from  class  X  that  had  the  highest  reward  for  the  training  images 
are  selected.  Then  each  of  these  S  programs  at  each  generation  is  reevaluated  on 
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Discrimination  Reward  Percentaqe 
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Figure  3:  PADO  discrimination  reward  percentage  on  test  images  of  Set  T. 


98  test  images  and  is  given  a  reward  based  on  how  it  performed  on  these  98  test 
images.  Then  these  S  rewards  are  averaged  to  give  the  score  for  class  J  that  is 
plotted  on  the  graph.  5  is  a  PADO  parameter  which  for  this  report  was  set  to  be 
seven. 

This  data  was  taken  on  five  separate  runs  and  the  graph  is  the  average  over 
these  five  runs.  The  range  of  possible  rewards  is  from  -100%  to  100%.  Random 
guessing  would  result  (on  average)  in  a  reward  of  0.  So  a  reward  of  positive  50% 
is  actually  75%  of  the  distance  from  the  lowest  possible  reward  to  the  highest 
possible  reward. 

Learning  (i.e.  improvement)  continues  after  generation  50,  but  the  learning 
rate  continues  to  diminish.  In  order  to  do  the  number  of  experiments  necessary 
for  reliable  data,  most  of  them  were  not  allowed  to  continue  far  past  generation 
50.® 

7.1.2.  Analysis 

Figures  2  and  3  show  the  increasing  abibty  of  the  best  programs  in  the  population 
to  distinguish  their  class  from  the  others.  The  most  obvious  “feature”  of  the 
graphs  is  that  none  of  the  curves  in  either  graphs  gets  much  above  50%  positive 
reward.  Does  this  mean  that  even  at  generation  50  the  most  fit  programs  are 
still  having  a  hard  time  distinguishing  their  class  from  the  others?  Basically,  the 
answer  is  no.  Remember  first  that  random  choice  would  average  a  reward  of  0. 
Because  the  numbers  that  these  programs  return  are  confidences,  it  is  possible  to 
be  “right”  about  a  picture  without  getting  the  maximum  possible  reward.  For 

®  Generation  50  was  picked  as  a  cut  off  point  in  part  because  it  is  a  traditional  benchmark 
generation  number  in  the  field  of  evolutionary  computation. 
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example,  if  a  program  which  is  later  designated  as  a  Thermos  program  returns  a 
confidence  of  0  upon  seeing  an  image  in  which  there  is  no  thermos,  its  reward  will 
be  maximal  (100%).  It  it  returns  a  confidence  of  255  then  its  reward  is  minimal 
(-100%).  If,  however,  it  returns  a  confidence  of  10,  for  example,  then  its  reward 
will  be  92%  which  is  close  to  the  maximum  reward  it  could  obtain  on  this  picture. 
So  the  fact  that  these  curves  do  not  climb  to  near  100%  reward  has  two  reasons. 
The  first  is  that  the  problem  is  very  hard  and  the  training  set  size  is  small.  So  no 
program  can  perfectly  fit  the  data.  The  second  reason  is  more  interesting.  There 
are  some  pictures  which  are  more  clearly  from  class  1  than  others.  So  it  makes 
sense  that  many  of  the  programs  trained  to  discriminate  images  with  objects  from 
class  I  from  other  images  should  learn  to  express  real  levels  of  confidence  based 
on  how  likely  they  think  it  is  that  the  picture  is  what  they  believe  it  to  be. 

A  second  important  facet  of  the  graphs  in  Figures  2  and  3  is  that  none  of 
the  curves  in  either  graph  is  drastically  lower  than  the  rest  of  the  curves.  While 
they  all  have  slightly  different  learning  curves  and  some  seem,  on  average,  to  be 
a  little  easier  to  distinguish,  they  all  keep  pace  with  the  pack  as  the  generations 
go  by.  Among  other  things,  this  is  an  excellent  indication  that  the  images  are 
all  about  the  same  level  of  difficulty  and  that,  because  the  graphs  grow  slowly  in 
their  performance,  this  difficulty  level  is  non-trivial. 


7.2.  Incremental  PADO 

This  subsection  details  a  similar  set  of  experiments.  In  these  experiments  only 
two  classes  were  trained  in  the  early  generations.  The  other  classes  were  added 
incrementally  in  periods  of  four  generations,  starting  at  generation  twelve. 


7.2.1.  Results 

Figures  4  and  5  show  the  results  obtained  in  reward  average.  Notice  that  these 
newly  added  class  discriminators  rise  very  quickly  to  the  point  that  they  would 
have  been  had  they  been  trained  from  the  beginning. 

At  generation  25,  the  incremental  learning  technique  has  used  about  half  the 
computation  time  that  the  standard  PADO  learning  took  for  all  seven  classes.  At 
generation  50,  the  incremental  learning  technique  has  used  about  75%  of  the  cycles 
that  the  standard  method  uses.  Also  notice  that  on  average  the  incrementally 
learned  classes  do  slightly  better  than  the  standard  technique. 

We  tried  different  training  class  orderings  and  obtained  similar  results.  This 
suggests  that  there  is  nothing  special  about  the  order  in  which  the  incremental 
classes  are  introduced  or  the  period  between  introductions.  The  only  exception 
is  that,  if  the  rate  of  introduction  of  new  classes  is  faster  than  one  every  four 
generations,  the  learning  curve  is  not  quite  as  steep.  This  is  probably  because 
there  is  a  period  of  a  few  generations  during  which  two  classes  are  both  trying  to 
get  up  to  speed.  Waiting  longer  between  introductions  produces  a  very  similar 
rise  in  performance.  In  short,  any  graph  must  show  this  incremental  learning 
process  using  a  particular  introduction  rate  and  a  particular  sequence  for  class 
introduction.  These  graphs  are,  however,  representative  of  graphs  with  various 
orderings  on  the  classes  and  various  speeds  of  introduction. 
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Figure  4:  Incremental  PADO  reward  percentage  on  test  images  of  Set  A. 
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Figure  5:  Incremental  PADO  reward  percentage  on  test  images  of  Set  T. 

7.2.2.  Analysis 

The  incremental  learning  graphs  in  Figures  4  and  5  show  similar  properties  to 
those  discussed  in  Figures  2  and  3  and  the  arguments  continue  to  be  valid.  The 
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difference  in  these  graphs  is  that  two  of  the  classes  are  trained  from  generation 
0.  Then  starting  at  generation  12,  a  new  class  and  new  training  images  are 
added  at  every  fourth  generation.  Unlike  the  graphs  from  the  standard  PADO 
technique,  these  graphs  do  show  dramatic  jumps  in  performance  over  a  small 
number  of  generations.  These  jumps  are  not  constant,  but  increase  in  steepness 
as  the  number  of  its  introduction  increases.  So  it  seems  that  a  class  added  later 
learns  up  to  the  level  of  its  fellow  classes  “faster”  than  previously  added  classes 
did. 

So  why  is  it  that  these  curves  of  the  added  classes  seem  to  jump  up  so  quickly 
to  meet  the  rest  of  the  curves  and  then  continue  on  as  though  it  had  been  trained 
from  the  beginning?  It  would  be  almost  impossible  to  give  a  definitive  answer  to 
this  question.  Instead,  consider  the  following  as  a  plausible  explanation.  At  some 
generation  Q  we  choose  to  add  in  another  class.  This  means  that  the  number 
of  groups  the  population  will  be  divided  into  at  the  end  of  generation  Q  will 
increase  by  one  and  the  training  set  size  will  increase  so  that  there  is  an  equal 
number  of  each  image  class,  including  now  the  new  image  class.  This  new  group 
will  be  formed  from  the  individuals  in  the  population  that  performed  best  as 
discriminators  of  this  new  class.  Even  at  the  end  of  generation  Q  we  can  expect 
some  non-negligible  performance.  This  new  object  must  be  “most  like”  one  of 
the  other  objects.  So  one  of  the  programs  from  this  old  class  that  is  most  similar 
will  serve  better  as  a  discriminator  than  one  we  would  pick  at  random.  For  a  few 
generations  afterwards  this  effect  is  still  important,  but  does  not  seem  sufficient 
to  explain  these  sizable  jumps  in  reward.  The  second  half  of  this  explanation  lies 
in  the  mechanism  of  the  libraries.  Rather  than  delve  into  them  here  we  will  finish 
this  discussion  in  the  natural  course  of  the  discussion  below  about  the  libraries. 
What  the  results  of  the  incremental  experiments  tell  us  is  that  we  can  quickly  get 
a  new  class  up  to  comparable  performance  with  the  current  classes.  This  may 
mean  that  larger  numbers  of  classes  can  stiU  be  learned  tractably  or  even  cheaply 
by  leveraging  off  existing  knowledge  through  this  incremental  learning  technique. 

7.3.  Object  Recognition  with  PADO 

As  was  outlined  in  Section  5,  object  recognition  for  C  classes  is  accomplished 
in  PADO  by  the  orchestration  of  C  different  systems.  Each  of  these  systems 
is  composed  of  the  S  most  fit  programs  from  the  corresponding  group  of  the 
current  generation.  In  order  to  show  object  recognition  from  generation  0  it  is 
necessary  to  learn  all  C  program  classes  from  the  beginning.  The  incremental 
technique  actually  seems  (on  average)  to  produce  slightly  better  programs,  but 
until  generation  24  or  28  respectively  some  classes  have  no  programs  and  so  the 
object  recognition  could  only  proceed  on  a  set  of  test  images  containing  objects 
from  the  classes  already  learned.  This  would  make  the  graphs  so  hard  to  decipher 
that  instead  we  simply  picked  programs  from  the  standard  PADO  strategy. 

Systemj  is  built  from  the  S  programs  that  best^  learned  to  recognize  an 
object  from  class  I.  The  S  responses  that  the  S  programs  return  on  seeing  a 
particular  image  are  aU  weighted  equally  and  their  weighted  average  of  responses 
is  interpreted  as  the  confidence  that  Systemj  has  that  the  image  in  question 


’^Based  on  the  training  results  from  that  generation. 
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Object  Recognition  Percentage  Correct 


Figure  6:  PADO  object  recognition  percentage  correct  on  Set  A  test  images. 


Object  Recognition  Percentage  Correct 


Figure  7:  PADO  object  recognition  percentage  correct  on  Set  T  test  images. 


contains  an  object  from  class  J.  PADO  does  object  recognition  by  orchestrating 
the  responses  of  the  C  systems.  The  confidence  response  of  each  system  is  initially 
weighted  equally  and  the  maximum  confidence  wins.  That  is,  if  System^  has  the 
highest  weighted  confidence  then  T  is  selected  as  the  class  of  the  image  object. 
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Figure  8:  The  weights  W  and  V  that  are  trained  early  in  testing. 


7.3.1.  Results 

In  these  experiments  (Figures  6  and  7)  we  followed  an  orchestration  method  where 
there  was  some  learning  early  in  the  test  phase.  During  the  first  few  test  images 
the  weights  are  adjusted  by  telling  PADO  after  its  guess  whether  it  was  right  or 
wrong.  Specifically,  each  program  JT”  in  a  particular  Systemx  has  its  weight  Vy 
adjusted  after  each  of  these  initial  tests  so  that  its  weight  increases  when  it  returns 
a  confidence  near  the  correct  confidence  and  decreases  if  its  returned  confidence  is 
far  from  the  correct  confidence.  Similarly,  Systemx  has  its  weight  FVj  adjusted  in 
the  same  manner.  This  strategy  (shown  in  Figure  8)  is  only  one  of  many  ways  that 
the  orchestration  could  be  accomplished.  Several  other  orchestration  strategies 
were  also  tried  with  similar  success.  This  orchestration  strategy  was  chosen  to 
obtain  these  results  because  it  works  well  and  is  simple  to  explain.  This  extra 
learning  (orchestration)  adds  only  a  few  milliseconds  to  the  total  testing  time. 

We  considered  throwing  out  these  first  few  test  image  results  since  there  was 
learning  (orchestration)  going  on  during  this  period.  But  since  these  first  few  test 
images  are  the  ones  PADO  does  worst  on  (since  it  is  stiU  orchestrating)  our  results 
improve  by  taking  them  out.  So  it  seemed  that  the  most  reasonable  thing  to  do 
would  be  to  count  the  results  of  all  the  test  image  guesses. 

The  data  for  Figures  6  and  7  were  averaged  over  five  runs.  On  each  run, 
on  each  generation,  a  PADO  program  was  compiled  from  C  systems  which  were 
each  built  from  the  best  programs  available  during  that  generation.  This  PADO 
program  was  then  tested  on  its  ability  to  correctly  recognize  which  object  was  in 
each  of  the  98  test  images. 
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7.3.2.  Analysis 

Figures  6  and  7  show  the  ability  of  PADO  to  do  object  recognition  on  two  sets  of 
image  data.  The  crux  of  the  paper  and  the  important  question  is  “Does  PADO 
succeed  at  the  task  of  object  recognition?  Is  PADO  worth  aU  this  trouble?”  The 
most  important  thing  to  point  out  is  that  if  we  constructed  a  simple  system  that 
simply  guessed  at  the  class  of  the  image  by  choosing  a  class  at  random,  it  would  be 
right  about  14%  of  the  time  (shown  as  a  dotted  hne  in  Figures  6  and  7).  PADO 
passes  this  level  of  performance  at  generation  3.  At  generation  50  the  percent 
of  the  time  that  PADO  correctly  identifies  the  image  class  is  about  3.5  times 
random  achievement.  On  images  as  unconstrained  as  these  images  are,  of  objects 
as  unfriendly  as  these  objects  are,  this  is  a  real  difference.  Issues  of  scalability 
and  potential  appUcation  for  PADO  will  be  discussed  in  the  next  sections. 

There  are  two  other  items  of  note  about  these  two  object  recognition  graphs. 
The  first  is  that  they  are  both  graphs  of  the  single  class  chosen  by  the  orches¬ 
tration.  Though  it  is  not  shown  on  the  graphs,  data  was  also  taken  about  the 
percent  of  the  time  that  the  correct  class  was  “second  place”  in  the  orchestration’s 
ranking  of  the  C  confidences.  For  both  data  sets  this  percentage  was  in  the  middle 
30’s.  This  means  that  between  70%  and  80%  of  the  time  the  correct  class  was  in 
the  top  two  in  the  orchestration’s  ranking.  This  is  interesting  as  a  symptom  of 
how  PADO  fails  when  it  does  fail,  and  highlights  how,  as  the  number  of  classes 
to  choose  from  increases,  the  easier  it  is  to  get  one  “bad”  vote  that  disrupts  the 
orchestration  for  that  image. 

The  second  item  is  the  relative  heights  of  the  graphs  in  Figures  6  and  7.  On 
average  the  image  set  T  (without  the  hand  in  it)  is  easier  to  generalize  to  than  is 
set  A  (with  the  hand  in  it).  The  two  data  sets  are  difficult  for  different  reasons. 
Set  A  includes  full  rotation  and  translation  and  includes  the  hand  and  arm  that 
possibly  occlude  part  of  the  object  in  question.  However,  all  the  objects  in  set 
A  are  more  or  less  parallel  to  the  image  plane.  For  Set  T  this  is  not  the  case. 
The  hand  is  absent  but  the  objects  are  rotated  and  translated  and  foreshortened. 
Because  the  hand  and  arm  introduce  a  significant  amount  of  noise  these  results 
are  not  surprising,  but  the  whole  issue  of  assessing  the  “difficulty”  level  in  dividing 
some  set  of  signals  into  classes  is  a  very  open  question. 

8.  Discussion 

The  previous  section  gave  enough  information  for  the  reader  to  read  each  graph 
and  extract  the  data  that  those  graphs  represent.  Along  with  each  set  of  results, 
parts  of  the  previous  section  were  dedicated  to  the  interpretation  of  that  data 
and  its  ramifications.  To  really  understand  these  results  and  their  ramifications, 
however,  a  more  complete  explanation  of  the  process  is  required.  This  section,  in 
a  question  and  answer  format,  will  try  to  explore  some  of  the  puzzling  issues  of 
PADO  and  the  task  of  object  recognition. 

8.1.  The  language.  The  Architecture,  and  The  Procedure 
Why  not  use  more  “helpful”  image  related  language  primitives? 

In  the  language  described  in  this  report,  there  are  six  ways  to  get  basic  inten¬ 
sity  information  from  the  input  image(see  Section  5.1).  The  PADO  mechanism  is 
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independent  of  the  details  of  the  language  used.  For  example,  (DIV  X  0)  could 
be  redefined  to  be  0  (instead  of  255)  and  if  we  redid  all  these  tests  we  would  get 
very  similar  results.  There  are  certainly  other  language  primitives  that  could  have 
been  put  into  the  language  which  might  have  improved  the  results  shown  here. 
An  obvious  example  is  (SPATIAL-DIFFERENTIAL  X  Y  U  V)  which  would  re¬ 
turn  the  spatial  differential  along  the  line  defined  by  points  X,Y  and  U,V.  A  more 
extreme  example  would  be  to  make  the  results  of  an  edge  segmenter  and  edge 
joiner  available  through  some  language  primitives.  There  are  two  reasons  why  we 
did  not  do  this.  The  first  is  that  in  the  spirit  of  trying  a  non  traditional  approach 
to  computer  vision,  it  seemed  worthwhile  to  see  what  level  of  success  we  could 
achieve  without  borrowing  any  notions  or  structures  from  traditional  computer 
vision.  This  was  originally  motivated  by  the  lack  of  success  that  computer  vision 
has  had  with  natural  object  in  natural  scenes.  In  fact,  because  we  believe  these 
results  to  be  non  trivial,  that  reason  seems  to  have  been  justified.  The  second 
reason  is  that  we  are  trying  to  create  a  learning  architecture  that  requires  minimal 
input  or  help  from  the  user.  The  more  time  a  programmer  must  spend  to  cus¬ 
tomize  a  language  in  order  to  get  good  results,  the  less  “autonomous”  the  system 
is.  We  have  shown  in  this  report  that  with  the  bare  minimum  of  input  from  the 
user  (i.e.  these  extremely  basic  functions  for  getting  image  intensity)  good  results 
are  still  possible.  And,  of  course,  if  still  better  results  are  sought,  this  system  can 
be  improved,  among  other  ways,  by  trying  other  language  primitives. 

Why  use  Indexed  Memory  as  the  memory  structure? 

Indexed  Memory  has  shown  itself  to  be  a  highly  successful  memory  structure 
in  the  field  of  Genetic  Programming  [14].  The  successful  programs  whose  perfor¬ 
mance  is  plotted  in  Figures  2  -  7  typically  use  from  5  to  30  of  their  memory  spots 
very  heavily  and  ignore  or  largely  ignore  the  rest  of  the  256  memory  elements. 
The  size  of  the  memory  was  chosen  to  be  256  elements  simply  because  then  every 
legal  value  would  be  a  legal  pointer  into  memory  and  because  every  memory  ele¬ 
ment  could  be  accessed  through  a  pointer  from  a  legal  value.  It  would  have  been 
possible  to  use  other  memory  sizes,  but  the  results  shown  above  were  not  found 
to  be  sensitive  to  changes  as  long  as  there  were  at  least  approximately  50  memory 
slots.  It  is  possible  that  a  different  memory  size  would  be  needed  for  a  different 
problem,  but  256  elements  of  8  bits  each  provide  for  approximately  3.2  +  10®^® 
different  states  already. 

What  part  do  the  Libraries  play  in  PADO’s  functionality? 

Initially  all  100  Library  functions  are  initialized  to  be  random  legal  subtrees 
with  the  same  characteristics  as  ADFs.  At  the  end  of  every  generation,  the  K 
worst  Library  functions  are  removed  from  the  Library  and  replaced  with  the  ADFs 
of  the  K  most  successful  programs  of  the  generation.  The  “goodness”  of  a  Library 
function  is  the  sum  of  the  adjusted  fitnesses  of  aU  programs  that  called  it,  multi¬ 
plied  by  how  often  they  called  it  (with  a  ceiling  of  once  per  fitness  case).  The  ad¬ 
justed  fitness  of  each  program  is  Rank[Program{I)\  —  {MaxRank  —  MinRank)f2. 
Notice  that,  unlike  the  main  loops  and  ADFs,  the  Libraries  do  not  evolve.  Rather 
they  are  a  storage  place  for  some  of  the  best  “ideas”  in  the  population  and  bad 
ideas  are  moved  out  in  favor  of  other  ideas  that  have  a  good  chance  of  being  good. 
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How  bad  library  functions  are  determined,  how  many  are  removed,  and  how  new 
ones  are  found  or  created,  is  still  an  active  area  of  our  research. 

What  do  the  SMART  Operators  do  that  traditional  crossover  can’t? 

When  evolving  functions  or  programs,  there  are  several  issues  related  to  evolv- 
ability.  The  three  most  important  are:  language  representation,  genetic  change 
paradigm,  and  genetic  change  operator(s).  In  Genetic  Programming  the  language 
representation  is  usually  a  Lisp-like  structure  much  hke  the  language  shown  in 
Section  5.  The  major  difference  we  have  introduced  is  that  we  are  now  evolv¬ 
ing  programs  instead  of  functions.  In  Genetic  Programming  the  main  genetic 
change  paradigm  is  crossover.  In  standard  crossover,  two  nodes  are  picked,  one 
from  each  expression-tree.  Then  these  nodes  (and  the  subtrees  under  them)  are 
exchanged.  PADO  maintains  this  basic  paradigm.  In  Genetic  Programming  the 
genetic  change  operator  is  random  (i.e.  the  two  nodes  to  be  exchanged  are  cho¬ 
sen  at  random).  This  does  not  work  for  PADO.  The  space  of  algorithms  is  much 
more  difficult  to  negotiate  than  is  the  space  of  functions  [15]  and  the  result  is 
that,  for  PADO,  traditional  crossover  stops  helping  at  relatively  low  fitness  levels. 
In  PADO  we  have  developed  SMART  operators  to  help  us  choose  which  nodes  to 
exchange,  where  these  two  nodes  are  either  both  in  main  loops  of  two  programs 
or  both  in  ADFs  of  two  programs.  A  SMART  operator  is  a  program  that  takes 
two  expression-trees  as  input  and,  after  some  deliberation,  indicates  two  nodes, 
one  in  each  tree,  that  are  to  be  exchanged.  These  SMART  operators  evolve  in  a 
separate  pool,  but  at  the  same  time  as  the  main  population,  so  that  how  they  act 
changes  with  the  changing  needs  of  the  main  population.  This  new  approach  to 
crossover  operators  has  been  much  more  successful  and  to  it  we  attribute  much 
of  PADO’s  success.  Unfortunately,  the  details  of  the  SMART  operators  might  fiU 
a  report  by  itself  and  are  the  subject  of  another  publication  [17]. 

Why  does  PADO  evolve  programs  instead  of  functions? 

As  was  just  mentioned,  evolution  becomes  more  difficult  (or  at  least  requires 
smarter  recombination)  when  programs  replace  functions  as  the  type  of  thing  in 
the  population  (see  Section  4).  If  there  was  no  benefit  to  using  programs  over 
using  functions,  we  and  PADO  could  avoid  a  lot  of  work.  However,  it  turns  out 
that  the  results  shown  in  this  report  could  not  be  obtained  in  a  similar  fashion 
when  functions  were  used  instead  of  programs.  In  order  to  achieve  80%  of  this 
report’s  results  using  functions,  PADO  requires  10  times  that  space  (memory)  and 
almost  5  times  as  many  hours  of  computation.  And  in  the  experiments  we  did,  the 
results  never  climbed  to  the  near  50%  object  recognition  rate  shown  in  Section  7.3. 

How  does  PADO  ensure  that  every  program  stops  after  a  fixed  time? 

As  was  mentioned  in  Section  5,  the  PADO  language  is  Turing  complete.  If 
PADO  waited  until  every  program  was  “finished”  where  finished  was  defined  by 
some  returned  value  or  state  of  its  memory,  it  is  hkely  that  PADO  would  never  get 
past  generation  0,  because  many  of  the  programs  generated  randomly  would  run 
on  forever.  PADO  avoids  this  problem  by  constructing  each  program  as  an  “any¬ 
time  algorithm.”  This  means  that  the  program  can  run  for  as  long  as  it  wants, 
but  at  any  time  PADO  may  interpret  what  it  has  done  so  far  as  its  “answer.”  The 
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main  loop  of  each  program  returns  a  value  after  each  execution.  Most  programs 
execute  their  main  loop  between  10  and  100  times  during  the  first  65  milhseconds 
of  program  execution.  So  PADO  stops  each  program  after  65  milliseconds  and 
interprets  this  series  of  values  that  the  program  has  returned  (one  for  each  main 
loop  completion)  as  its  answer.  Since  the  program  must  return  a  single  value, 
this  series  of  values  is  averaged.  And  because  it  is  likely  that  the  values  near  the 
end  of  its  execution  are  more  “informed”  about  the  correct  answer  (having  had 
more  time  to  “think”)  the  series  of  values  is  Unearly  weighted  so  that  the  value 
returned  at  time  T  is  weighted  by  T.  This  is  just  one  way  around  the  problem 
of  waiting  for  all  programs  to  halt.  There  are  other  ways  to  constrain  the  con¬ 
struction  of  programs  so  that  they  all  halt  in  a  bounded  amount  of  time.  None 
of  these  techniques  were  investigated.  There  are  also  other  ways  of  enforcing  the 
anytime  solution  that  PADO  uses.  For  example,  we  tried  running  each  program 
for  65  milliseconds  and  then  interpreting  the  answer  in  Memory[0]  as  the  answer. 
This  also  worked  weU  and  we  continue  experiments  using  this  strategy.  We  finish 
by  remarking  only  that  PADO’s  architecture  only  requires  that  it  get  a  fitness  for 
each  program  in  a  bounded  amount  of  time.  How  we  do  this  here  is  an  implemen¬ 
tation  detail. 

Why  was  each  program  only  given  65  milliseconds  to  run? 

If  we  had  given  them  1  second  each  instead  of  l/15s,  learning  would  have 
taken  15  times  as  long.  So  we  don’t  have  good  information  on  how  much  better 
these  programs  could  do  if  given  such  a  long  stretch  to  think  about  a  single  im¬ 
age.  Some  tests  were  done  for  time  thresholds  as  long  as  250  milliseconds.  There 
was  some  increase  in  the  peak  performance  for  each  generation,  but  taking  into 
account  the  longer  time  to  run  each  generation,  the  rate  of  increase,  in  computer 
cycles,  of  the  peak  performances  for  each  generation  was  higher  with  rates  closer 
to  50  milliseconds.  Since  machines  change  and  as  the  environment  improves,  this 
1/15  of  second  may  become  more  like  1/30  of  a  second.  It  is  more  relevant,  then, 
to  talk  about  how  much  work  gets  done  in  the  time  allotted.  At  1/15  of  a  second 
each  program  is  able  to  evaluate  between  1000  and  8000  nodes.  These  numbers 
are,  of  course,  machine  and  implementation  dependent.  Remember,  some  nodes, 
like  the  terminals  and  the  simple  non-terminals  (e.g.  ADD),  evaluate  quickly. 
But  some  of  the  nodes  take  a  very  long  time  to  evaluate  (e.g.  (VARIANCE  0 
0  255  255)  takes  about  0.5  milliseconds).  Since  each  program  is  given  a  small, 
fixed  amount  of  time  to  run,  it  must  decide  how  to  design  its  code  to  balance 
this  difference  in  the  time  cost  of  evaluating  various  nodes.  So  the  answer  is  that 
1/15  of  a  second  was  chosen  as  a  number  which  balanced  well  the  criteria  just 
mentioned.  This  choice  means  that  the  object  recognizer  will  take  C  =)=  5  +  1/15 
seconds  to  do  the  object  recognition.  For  the  results  described  in  Section  7  this  is 
about  3  seconds.  Another  point  worth  mentioning  is  that  these  programs,  while 
very  Lisp-like,  are  being  interpreted  each  time  they  run.  If  you  had  C  programs 
which  made  up  some  PADO  object  recognizer,  they  could  be  compiled  into  LISP, 
C,  Pascal,  etc.  and  then  from  there  compiled  into  assembly  and  run.  This  would 
probably  yield  a  speed  increase  of  between  5  and  20  fold. 
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Where  is  the  “Parallel”  in  PADO? 

Above,  we  mentioned  that  PADO  {on  C  =  7  classes)  took  about  3.2  seconds 
to  make  a  prediction.  Again,  this  number  is  machine  and  implementation  depen¬ 
dent.  This  speed  is  a  little  slow  for  a  reactive  agent,  but  for  computer  vision  in 
general  this  speed  is  reasonable.  If  we  were  to  scale  up  to  C  =  100  classes  and  kept 
S  =  7  programs  per  system,  it  would  now  take  about  700  *  (1/15)  =  46  seconds 
to  recognize  one  image.  If  we  compiled  the  programs  as  mentioned  above,  this 
would  probably  drop  to  3  or  4  seconds,  but  this  speed  is  still  a  httle  slow.  Which 
brings  us  to  the  P  in  PADO:  Parallel.  Unlike  most  learned  systems,  the  complex 
job  that  PADO  does  can  be  easily  parallelized.  If  there  are  100  classes  and  7 
programs  have  been  selected  from  each  of  the  100  groups,  there  are  700  programs 
to  run  and  if  even  50  processors  were  available,  the  time  to  find  an  answer  could 
be  cut  by  a  factor  of  50.  Because  aU  the  programs  take  exactly  the  same  amount 
of  time  to  run  the  speed  up  from  the  parallelism  will  be  exactly  hnear.  So  a  robot 
that  had  even  4  processors  on  board  could  recognize  100  classes  in  less  than  a 
second,  using  the  current  compiled  PADO  technology. 

Why  was  the  Orchestration  a  weight  vector  tuning? 

Orchestration  is  a  crucial  part  of  the  PADO  architecture.  The  implementa¬ 
tion  of  this  orchestration  is  less  important  partly  because  there  are  several  good 
solutions.  For  example,  an  alternative  method  for  orchestration  using  PADO  it¬ 
self  was  tested.  That  is,  each  orchestration  procedure  was  a  program  developed 
in  a  separate  PADO  run  in  which  the  inputs  were  the  outputs  of  S  different  ob¬ 
ject  discrimination  programs  and  the  output  was  a  confidence.  Because  this  is 
a  program  it  is  easy  to  see  that  this  could  learn  to  return  I  where  program  J 
gave  the  largest  input  to  this  orchestration  program.  This  orchestration  program 
could,  in  fact,  implement  the  weight  vector  tuning  in  its  memory  or  do  something 
even  more  complicated.  Because  this  implementation  of  orchestration  required  a 
second,  smaller  training  set  and  took  more  time,  it  was  not  used  as  the  example 
for  this  report,  but  its  results  on  object  recognition  were  comparable  to  those 
shown  in  Section  7.3.  An  additional  piece  of  appeal  for  the  weight  vector  tuning 
is  that  because  it  is  so  fast,  there  is  really  no  reason  not  to  leave  it  on  aU  the 
time  during  its  “testing”  phase.  During  its  life  span  of  usefulness,  conditions  may 
change  or  even  be  periodic.  This  continual  orchestration  tunning  would  allow  the 
learned  system  to  quickly  turn  over  the  decision  making  to  those  programs  most 
suited  for  the  current  conditions.  Paradigmatically,  this  continual  adjustment  also 
seems  like  a  reasonable  way  to  test  the  performance  of  a  system.  Off-line  PADO 
can  do  a  large  amount  of  computation  for  learning.  During  testing,  however,  it 
should  be  the  case  that  the  system  gets  feedback  about  how  it  is  doing.  Any  addi¬ 
tional  learning  it  can  do  in  real  time  should  not  only  be  allowed,  but  encouraged. 
Orchestration  can  take  place  as  part  of  the  training  phase,  but  we  use  the  word 
orchestration  rather  than  learning  partly  because  it  is  the  activity  of  orchestrating 
the  parts  that  is  important,  and  not  that  it  counts  as  “learning”  or  “testing”. 

How  much  time  did  it  take  to  get  these  results? 

As  has  been  discussed,  the  orchestration  takes  milliseconds,  so  the  entire  cost 
to  develop  a  PADO  recognizer  is  the  time  it  takes  to  evolve  the  programs  to  be 
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used  in  that  PADO  recognizer.  It  takes  approximately  48  hours  of  CPU  time  on 
a  DECstation5000/20  to  train  7  classes  up  to  50  generations.  Since  this  data  was 
taken,  improvements  have  been  made  to  the  learning  environment  which  would 
have  cut  this  time  down  to  24  hours  of  CPU  time.  One  incremental  learning 
run  took  about  37  hours  of  CPU  time  and  would,  with  the  new  environment, 
take  about  18  hours  of  CPU  time.  The  environment  is  written  in  C.  Further 
improvements  and  exploration  of  alternative  techniques  are  being  explored  in  our 
current  research. 

8.2.  Related  Issues 

Why  use  more  than  one  program  to  decide  on  a  confidence  for  class  J? 

It  seems,  at  first,  that  if  the  “best”  program  is  the  best,  then  trusting  the 
response  of  that  program  is  the  best  PADO  will  be  able  to  do.  The  first  reason 
this  approach  would  not  be  optimal  is  that  the  top  few  programs  are  picked  as 
determined  by  the  training  phase.  There  is  no  guarantee  that  the  individual  that 
best  fit  the  training  data  will  also  generalize  the  best.  So  that  is  a  good  reason 
for  taking  a  few  of  the  top  programs  from  each  group  I.  If  the  information  ex¬ 
tracted  by  each  program  from  the  image  were  very  similar  (i.e.  if  their  responses 
on  specific  pictures  were  highly  correlated),  then  that  would  be  the  only  good 
reason.  However,  this  correlation  is  not  the  case.  It  turns  out  the  errors  that 
the  best  program  in  group  I  makes  on  test  images  is  largely  independent  of  the 
errors  the  second  best  program  in  group  I  makes  on  test  images.  And  similarly 
for  the  second  best  relative  to  the  third,  etc.  This  means  (approximately)  that  we 
can  reduce  our  error  by  polling  several  of  these  programs.  The  chance  that  the 
majority  of  them  are  wrong  goes  down  as  the  number  of  them  increases.  Because 
the  top  few  programs  are  much  more  fit  (able  to  discriminate  correctly)  than,  for 
example,  the  average  program  from  that  group,  there  would  be  a  disadvantage  to 
taking  too  many  to  use  in  the  PADO  object  recognition.  This  is  why  the  PADO 
object  recognizer  does  not  use  programs  to  do  predictions.  PADO  takes  S 
programs  from  each  group,  where  >5  is  a  constant.  So  PADO  grows  Unearly  in 
time  and  space  with  increasing  group  size. 

How  do  we  know  that  these  images  aren’t  easy  to  distinguish? 

Figure  9  shows  the  average  intensity  for  each  of  21  randomly  selected  images 
from  each  class  from  each  set  (273  images  total).  Clearly,  this  piece  of  global 
information  is  not  enough  to  partition  the  images  into  the  correct  classes.  That 
fact,  of  course,  does  not  rule  out  the  existence  of  some  global  property  of  the  image 
on  which  partitioning  can  be  done  successfully.  Since  the  four  non-local  image 
primitives  these  programs  had  were  Average,  Min,  Max,  and  Variance,  all  four  of 
these  were  tested  as  global  properties.  The  graphs  of  all  four  for  both  data  types 
look  very  much  like  the  two  shown  above.  The  large  amount  of  variance  between 
pictures  in  the  same  class  for  all  four  of  these  tests  suggests  that  there  is  no  real 
predictive  power  in  these  values.  So  any  successful  technique  for  recognizing  the 
objects  in  these  images  must  be  based  in  part  on  the  ability  to  focus  attention. 
Beyond  the  foviation  that  we  have  just  concluded  the  PADO  recognizer  is  doing, 
it  becomes  very  difficult  to  say  how  hard  two  images  are  to  distinguish  from 
each  other.  Also,  because  of  the  complexity  and  density  of  these  programs,  it  is 
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Figure  9:  Average  image  intensity  for  random  images  from  each  class. 


very  hard  to  ascertain  whether,  for  example  a  particular  successful  program  is 
looking  for  light-dark  boundaries,  or  whether  it  is  looking  for  particular  shapes, 
or  textures,  or  any  other  type  of  visual  clue. 

One  detail  of  note  is  that  PADO  may  give  us  object  position  for  free  in  im¬ 
ages  with  a  single  object  each.  As  was  mentioned  in  Section  6,  some  people  may 
feel  that  object  recognition  requires  at  least  object  position  location  in  the  image 
and  maybe  even  pose  determination.  Each  image  primitive  that  PADO  executes 
(pixel,  region,  average,  least,  most,  variance)  takes  place  in  a  particular  part  of 
the  image.  A  simple  computation  is  to  find,  for  each  image  that  some  particular 
“best”  program  from  class  I  looks  at,  what  the  center  of  mass  of  aU  these  image 
primitive  locations  is.  It  turns  that  with  high  probability  this  center  of  mass  will 
be  near  the  center  of  the  object  in  the  image.  Since  the  object  typically  takes 
about  25%  of  the  image  and  moves  widely  between  images,  and  this  happens  with 
regularity,  PADO  demonstrates  its  ability  to  focus  attention  and  returns  a  free 
piece  of  information:  object  location  in  the  image! 

What  reasons  exist  for  believing  that  PADO  will  scale  up  to  100  classes? 

PADO  ranks  the  population  along  C  different  dimensions.  Sometimes  it  turns 
out  that  one  of  the  best  in  class  J  is  also  a  good  program  in  class  T,  where  T  ^  J. 
In  other  words,  it  might  turn  out  that  at  the  end  of  some  generation,  program  A 
is  the  best  at  distinguishing  bottles  from  other  objects,  but  also  happens  to  be 
pretty  good  at  distinguishing  coke  cans  from  other  objects.  This  must  be  because 
the  coke  can  and  the  sprite  bottle  have  some  visual  similarities  (e.g.  shape).  To 
scale  up  the  number  of  classes,  we  will  introduce  a  hierarchical  orchestration.  The 
difficult  part  of  PADO  will  become  the  construction  of  this  hierarchical  structure 
for  classes.  We  can  use  indications  of  similarity  fike  the  one  just  mentioned  to 
find  correlations  between  the  image  classes  and  build  the  hierarchy.  More  will  be 
said  on  this  in  the  future  work  section. 
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Will  PADO  work  with  more  similar  classes? 

The  results  shown  in  this  report  are  not  the  only  tests  PADO  has  passed.  For 
example,  set  T  originally  had  three  dimensions:  color,  saturation,  and  brightness. 
When  PADO  was  given  the  color  information,  its  performance  on  object  recogni¬ 
tion  jumped  to  about  95%.  The  reason  is  undoubtably  because  the  images  were 
“too”  different  along  some  dimension.  That  dimension  was  color.  The  objects 
were  of  such  different  colors  that  this  information  is,  by  itself,  a  very  good  indi¬ 
cator  of  the  class.  In  an  effort  to  make  the  problem  harder,  we  gave  PADO  only 
the  brightness  data  (the  greyscale  images  shown  in  Figure  1).  PADO  didn’t  do 
as  well,  but  given  that  the  problem  got  much  harder,  stiU  did  very  well.  This  will 
be  mentioned  again  in  the  future  work  section.  So  while  PADO’s  performance  on 
7  different  shoe  classes  is  in  question,  we  could  certainly  have  chosen  objects,  or 
signals  that  represent  those  objects,  that  were  much  less  similar. 

9.  Future  Work 

Scalability 

The  issue  of  scalability  is  critical  to  the  success  of  PADO.  We  are  trying  to  design 
a  learning  architecture  that  can  build  up  useful  systems  to  be  applied  to  natural 
environments.  Because  PADO  was  designed  with  an  eye  for  success  as  well  as 
scientific  advancement,  it  must  be  able  to  scale  up  to  hard  problems. 

The  most  obvious  area  of  scalability  is  in  the  number  of  classes.  This  report 
dealt  with  a  system  that  performed  well  in  an  environment  with  7  classes.  While 
there  are  real  applications  for  recognizing  5  to  10  classes,  most  applications  need 
to  be  able  to  recognize  hundreds  of  classes.  In  the  discussion  section  we  mentioned 
that  new  types  of  orchestration  are  an  active  part  of  our  current  and  future  work. 
By  doing  a  hierarchical  structure  for  the  orchestration,  we  hope  the  number  of 
classes  PADO  can  handle  will  at  least  move  into  the  hundreds. 

A  second  important  area  of  scalability  is  performance.  PADO’s  results  on  set 
A  and  set  T  were  much  better  than  random,  but  they  were  not  perfect.  Many 
applications  can  be  useful  even  if  there  is  some  error,  but  most  apphcations  require 
error  rates  of  5%  to  10%  and  some  cannot  tolerate  even  that.  So  PADO’s  abihty 
to  improve  its  performance  is  also  critical  for  PADO’s  future.  This  issue  becomes 
even  more  important  when  we  remember  that,  as  PADO  begins  to  tackle  hundreds 
of  objects  instead  of  tens  of  objects,  the  difficulty  of  the  problem  rises  very  quickly 
and  would  be  impressive  for  PADO  to  maintain  its  current  performance  on  seven 
classes  as  the  number  of  classes  rises. 

Though  less  pressing,  time  factor  is  also  a  scalability  issue  for  PADO.  As  we 
require  PADO  to  perform  much  better  on  many  more  classes,  the  number  of  gener¬ 
ations  necessary  to  achieve  this  state  will  skyrocket.  So  even  though  most  learning 
can  be  done  “off  line”  in  a  non  time-critical  way,  it  will  become  increasingly  im¬ 
portant  to  find  faster  ways  to  implement  PADO  and  better  architectural  choices 
that  require  less  space  or  time.  One  point  here  in  PADO’s  favor  is  that  evolu¬ 
tionary  computation  lends  itself  easily  to  massive  parallelization.  So  by  using  100 
processors,  we  could  divide  PADO’s  learning  time  100  fold.  An  additional  avenue 
that  we  have  already  started  to  explore  is  incremental  learning.  As  discussed  in 
the  Section  7.2  it  is  possible  to  train  several  classes,  and  then  later  add  a  new 
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class.  As  our  ability  to  get  new  classes  quickly  up  to  speed  improves,  we  may  be 
able  to  conquer  the  time  problem  this  way. 

Partly  because  they  are  so  difficult  to  understand,  it  is  hard  to  know  exactly 
what  role  the  Libraries  and  the  SMART  operators  play  in  PADO’s  performance. 
Clearly  the  future  performance  of  PADO  will  depend  heavily  on  how  well  we  can 
improve  these  features  of  the  PADO  architecture. 

Investigation  for  Greater  Understanding 

There  are  several  parts  of  the  PADO  architecture  that  are  not  yet  well  understood. 
Two  examples  just  mentioned  are  the  Libraries  and  the  SMART  operators.  In 
both  cases  we  have  good  indications  that  they  are  vital  to  the  functioning  of  the 
system,  but  we  don’t  really  understand  why.  The  work  we  are  doing  to  better  un¬ 
derstand  different  aspects  of  PADO  will  help  us  to  change  these  things  to  achieve 
some  of  the  goals  of  scalability  mentioned  above.  Another  example  of  a  PADO 
aspect  that  should  be  investigated  is  the  programs  themselves.  By  understanding 
better  how  they  accomplish  their  tasks  we  may  be  able  to  learn  about  how,  in 
general,  object  recognition  and  signal  understanding  is  done. 

Innovative  Applications  of  PADO 

PADO  was  not  designed  specifically  for  images.  It  was  designed  to  be  able  to 
do  classification  on  any  set  of  signals.  So  far  PADO  has  only  been  applied  to 
images.  One  of  the  most  important  next  steps  in  PADO’s  growth  will  be  a  series 
of  classification  tests  on  a  variety  of  signal  types,  all  of  which  were  not  designed 
with  PADO  in  mind.  These  signal  types  wiU  include  speech,  sonar,  radar,  text, 
and  several  others. 

In  summary,  the  open  questions  of  this  research  effort  are: 

•  Can  PADO  scale  up  to  a  large  number  of  classes? 

•  Can  PADO  learn  the  same  amount  much  faster? 

•  Can  PADO  improve  its  performance  even  as  the  number  of  classes  increase? 

•  Will  PADO  prove  to  be  a  general  signal  classification  learning  architecture? 

The  investigation  of  answers  to  these  questions  is  part  of  our  immediate  research 
agenda. 

10.  Conclusions 

This  report  began  with  the  motivation  of  the  signal-to-symbol  problem.  AI  sys¬ 
tems  need  to  reason  about  high  level  information,  but  the  world  provides  a  huge 
amount  of  noisy  perception  instead.  Any  bridge  between  these  two  realms  is  a 
significant  tool  for  AI.  In  humans,  we  depend  most  heavily  on  the  signal-to-symbol 
translation  in  our  visual  cortex.  Taking  this  as  our  cue,  we  decided  to  tackle  the 
problem  of  object  recognition.  Object  recognition’s  major  flaw  has  been  that  it 
does  not  address  unconstrained  or  “natural”  environments  very  weU.  Machine 
learning  has,  aside  from  some  small  pockets  of  success,  not  yet  delivered  in  this 
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area.  This  level  of  success  may  be  because  the  learning  architectures  that  have 
been  applied  were  not  designed  for  the  task  of  understanding  real  world  signals. 
Out  of  this  belief  that  new  architectures  must  be  found,  PADO  was  born. 

This  report  has  shown  an  application  of  PADO  on  three  related  tasks  with  two 
different  sets  of  image  data.  Both  sets  of  image  data  fall  outside  the  constraint 
boundaries  that  object  recognition  tasks  usually  require.  That  is,  translation, 
rotation,  lighting,  and  even  foreshortening  were  allowed  for  the  object  classes. 
In  addition,  the  objects  that  made  up  the  classes  were  not  simple,  uni-colored, 
geometric,  or  even  rigid  in  some  cases. 

On  these  difficult  problems,  PADO  achieved  an  object  recognition  rate  of 
about  50%.  Given  that  there  were  seven  classes,  this  is  about  3.5  times  better 
than  random  guessing  would  accomplish.  As  was  mentioned  in  the  discussion 
section,  PADO’s  performance  on  images  from  Set  T  jumps  to  about  95%  when 
the  original  color  images  are  used.  So  the  performance  described  in  this  report  is 
on  image  sets  that  have  been  made  deliberately  difficult. 

PADO  achieved  this  performance  with  no  help  from  users  or  domain  specific 
information  of  any  kind.  To  prove  this  point,  PADO  was  given  as  its  primitives 
for  accessing  images  the  simplest  possible  functions:  Pixel,  Region,  Least,  Most, 
Average,  and  Variance.  The  fact  that  these  primitives  were  coded  for  PADO 
in  half  an  hour  and  were  found  to  be  sufficient  for  a  different  image  set  (Set 
T)  supports  the  hypothesis  that  PADO  can  make  do  with  little  or  no  outside 
intervention. 

The  design  of  PADO’s  architecture  provides  for  several  exciting  features. 

•  At  any  point  during  the  learning  process,  a  program,  as  a  group  of  “signal 
understanding”  systems,  can  be  extracted  and  used  immediately  for  object 
recognition. 

•  PADO  incorporates  evolution  into  its  design,  thereby  providing  the  chance 
to  exploit  a  myriad  of  different  solutions  through  orchestration. 

•  The  orchestration  of  independent  solutions  makes  it  possible  (simple  even) 
to  run  PADO  on  a  parallel  machine  for  a  linear  speed  increase. 

•  The  architecture’s  independence  from  the  particular  examined  signal  makes 
it  viable  to  use  PADO  for  any  signal  type. 

The  future  paths  of  PADO  research  are  diverse.  In  the  near  future  we  hope 
to  see  PADO  performing  better,  with  fewer  examples,  of  more  classes,  on  harder 
images.  PADO  has  matured  so  quickly  that  this  goal  seems  much  more  realistic 
now  than  it  did  even  when  the  writing  of  this  report  began.  In  addition,  PADO 
has  been  designed  to  perform  signal  understanding  on  any  signal  type:  from  text, 
to  sonar,  to  spectrum,  to  speech.  Our  goal  is  to  have  performance  at  the  current 
highest  levels  of  understanding  in  any  signal  type  we  try. 

Researchers  today  spend  a  significant  amount  of  their  time  finding  the  right 
learning  algorithm  for  their  task,  and  then  tweaking  that  algorithm  until  it  per¬ 
forms.  There  are  simply  too  many  domains  for  this  to  lead  to  real  progress.  We 
believe  that  PADO,  as  a  domain  independent  learning  architectures,  can  have  a 
significant  impact  in  solving  the  general  signal-to-symbol  problem. 
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Appendix 

A.  Sample  Program 

This  program  was  the  best  at  recognizing  hand  held  BOOKs  in  Generation  37. 
Repeat 

(IF-THEN-ELSE  (LESS  46  (Library22  (EITHER  14  200  66)  (VARIANCE  (LESS  44  62)  (PIXEL  138  165)  193  200)  119 
99))  (Library2S  (VARIANCE  29  (EITHER  (READ  (WRITE  92  (Library32  (EITHER  240  193  (EITHER  185  233  (MOST 

226  153  45  246)))  2  (EITHER  38  (PIXEL  5  71)  175)  165)))  32  (LibrarySO  152  135  208  101))  (EITHER  116  88  85)  95) 
211  119  (READ  (Library65  169  (IF-THEN-ELSE  146  (EITHER  (IF-THEN-ELSE  68  (ADF  234  144  181  8)  223)  (WRITE 
(EITHER  216  17  225)  137)  154)  173)  (LibrarySl  170  25  (READ  (READ  190))  (EQ  162  81))  0)))  (EITHER  217  (IF- 
THEN-ELSE  200  191  9)  (EITHER  (EITHER  (NOT  (EQ  75  167))  166  (ADF  (ADF  69  129  145  6)  (IF-THEN-ELSE  201 
249  80)  (ADF  188  (MOST  207  21  (READ  (VARIANCE  254  44  102  135))  85)  216  125)  (LESS  (LEAST  114  46  78  (WRITE 
109  86))  160)))  (MULT  (SUB  (IF-THEN-ELSE  155  (LEAST  43  56  156  184)  136)  36)  165)  31))) 

Until  Time-Limit 

ADF(P1  P2  P3  P4):  (IF-THEN-ELSE  (LESS  (EQ  181  (LEAST  (IF-THEN-ELSE  193  0  199)  (IF-THEN- 
ELSE  P3  71  P2)  40  PI))  (READ  P4))  (READ  (NOT  (READ  (IF-THEN-ELSE  189  85  132))))  (LESS  (IF-THEN-ELSE 
(EQ  P3  P4)  PI  (DIV  (EQ  91  (WRITE  (EITHER  82  51  P2)  245))  P3))  (LESS  (VARIANCE  (EITHER  (READ  PI)  PI 
P2)  121  (IF-THEN-ELSE  190  PI  (LESS  (VARIANCE  P4  177  P3  P2)  P4))  26)  (READ  PI)))) 

Library22(Pl  P2  P3  P4):  (VARIANCE  (IF-THEN-ELSE  (EITHER  PI  55  (SUB  (MIN  (ADD  P4  P2) 
(WRITE  54  74))  (AVERAGE  (EQ  (EITHER  90  P3  P2)  218)  62  (NOT  P3)  (EQ  78  168))))  (MULT  (IF-THEN-ELSE  28 
PI  P2)  (LESS  (LEAST  224  P2  (READ  P2)  (LEAST  5  76  152  (REGION  142  166)))  PI))  (READ  (ADD  (VARIANCE 
222  P2  (IF-THEN-ELSE  PI  163  P4)  P2)  PI)))  PI  (NOT  (READ  (WRITE  (VARIANCE  (IF-THEN-ELSE  57  (WRITE 

227  P4)  (VARIANCE  P3  P3  P2  180))  (PIXEL  151  PI)  140  (EQ  (VARIANCE  P2  173  70  179)  102))  (REGION  251  P4)))) 
(VARIANCE  2  PI  131  P2)) 

Library25(Pl  P2  P3  P4):  (LEAST  PI  (MULT  (LESS  (MAX  P4  P3)  (EQ  33  (NOT  PI)))  93)  P2  (IF- 
THEN-ELSE  (NOT  (DIV  (IF-THEN-ELSE  (REGION  109  106)  PI  (VARIANCE  P2  93  390  P3))  (REGION  P4  (READ 
P4))))  P3  P2)) 

Library32(Pl  P2  P3  P4):  (SUB  P4  (VARIANCE  (WRITE  203  PI)  (READ  (VARIANCE  (IF-THEN- 
ELSE  157  (EITHER  77  244  (VARIANCE  226  P2  P3  PI))  146)  (VARIANCE  164  P3  217  160)  PI  (EITHER  PI  (VARIANCE 
46  (NOT  219)  P4  (WRITE  P4  P4))  PI)))  (READ  PI)  (EITHER  PI  (WRITE  (NOT  P3)  (EITHER  (IF-THEN-ELSE  110 
(DIV  245  46)  P2)  P4  229))  (VARIANCE  (LESS  222  93)  (READ  10)  (VARIANCE  P3  P4  (WRITE  (VARIANCE  46  130 
208  215)  169)  (READ  164))  (WRITE  (VARIANCE  PI  PI  P4  (EQ  P4  139))  (EITHER  P4  216  P2)))))) 
Library50(Pl  P2  P3  P4):  (MOST  102  (VARIANCE  PI  117  137  (READ  P2))  (NOT  PI)  (IF-THEN- 
ELSE  P4  P4  P4)) 

Library51(Pl  P2  P3  P4):  (EQ  PI  (VARIANCE  (AVERAGE  0  47  (MOST  PI  64  (EITHER  P3  PI  40) 
(REGION  (LEAST  P4  (EITHER  132  73  111)  226  24)  (WRITE  (MIN  P3  P4)  (LESS  PI  P3))))  (EITHER  (IF-THEN-ELSE 
PI  P4  (EITHER  162  PI  198))  (EITHER  76  26  88)  121))  (WRITE  131  (VARIANCE  PI  (IF-THEN-ELSE  211  (READ  32) 
(REGION  no  247))  235  124))  PI  (WRITE  (EITHER  (DIV  (AVERAGE  86  (WRITE  P3  P3)  215  P4)  83)  125  P4)  P3))) 
Library65(Pl  P2  P3  P4):  (REGION  (EITHER  6  113  168)  P4) 
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